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Purpose:  To  examine  theoretical  aspects  of  relational  databases(and  its  extensions)  suggested  by 
computer  science  considerations.  - 

Accomplishments:  Two  distinct  topics  were  investigated  during  the  grant  period.  Each  was 
studied  by  one  of  the  above  research  assistants  and  is  to  become  am  integral  part  of  his  PhD  thesis. 
A  brief  summary  of  the  results  obtained  are  now  given.  Enclosed  with  this  review  are  two  reports, 
describing  in  more  detail  the  results  obtained. 

The  first  report,  “Properties  of  Spreadsheet  Histories”,  is  by  Stephen  Kurtzman.  In  this  report, 
some  theoretical  aspects  of  one  of  the  most  widely  used  types  of  small-business  data-processing  soft¬ 
ware,  namely  the  spreadsheet  program,  are  studied.  In  particular,  a  formal  model  for  spreadsheet 
histories  is  presented  and  examined  with  respect  to  two  questions.  First,  the  database  operations 
of  selection,  projection,  cohesion  and  union  are  considered.  The  primary  question  here  is  whether 
or  not  these  operations  preserve  the  formal  model  The  answer  is  yes  for  selection,  cohesion  and 
intersection ,  and  no  for  the  remaining  two.  A  necessary  and  sufficient  condition  is  then  given 
for  projection  and  union  to  preserve  the  basic  model.  The  second  question  concerns  a  notion  of 
spreadsheet-history  equivalence  based  on  the  projection  operation.  This  concept,  projection  sim¬ 
ulation,  is  defined  and  three  sufficiency  conditions  presented  for  when  it  preserves  an  important 
subclass,  the  “history  bounded”,  of  the  data  model. 

The  second  report,  “Declarative  Sequence  Operations  and  Their  Usage  in  Query  Languages”,  is 
by  Xiaoyang  Wang.  In  this  report,  a  family  of  declarative  sequence  operations  (based  on  regular 
expressions  from  formal  language  theory)  is  introduced  and  studied.  The  operations  were  chosen 
for  their  power  and  simplicity  in  describing  the  access  of  sequential  information.  There  are  two 
parts  to  the  investigation.  The  first  part  is  devoted  to  the  study  of  these  sequence  operations.  In 
particular,  they  are  characterized  by  a  type  of  automaton.  The  decomposition  of  operations  is  then 
considered  and  an  infinite  non-collapsing  hierarchy  of  sequence  operations  established.  The  second 
part  of  the  report  introduces  both  a  database  model  with  sequences  as  a  basic  data  construct,  and 
a  query  language  defined  in  terms  of  the  sequence  operations.  Numerous  examples  are  given  to 
demonstrate  the  practical  utility  of  the  sequence  operations. 
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Declarative  Sequence  Operations  and 
Their  Usage  in  Query  Languages^ 

Xiaoyang  Wang 
Computer  Science  Department 
University  of  Southern  California 
Los  Angeles,  CA  90089-0782 


Abstract 

A  family  of  declarative  sequence  operations  based  on  regular  expressions  is  in¬ 
troduced  and  their  properties  studied.  Using  these  operations,  an  extension  of  SQL 
is  presented  over  a  simple  database  model  with  sequence  constructors. 


1  Introduction 


In  a  typical  complex-object  database  system,  tuple,  set  and  sequence  (or  list)  are  the  three 
major  “bulk”  data  constructors  [3,  5,  15].  While  tuples  and  sets  have  been  extensively 
studied  [1,  2,  4,  9,  14],  very  few  investigations  about  sequences  in  databases  have  been 
reported,  and  notably,  no  declarative  query  system  on  sequences  has  ever  been  proposed 
in  the  literature.  The  purpose  of  this  paper  is  to  introduce  such  a  system  using  regular 
expressions  as  declarative  operations  over  sequences. 

The  basic  idea  of  this  paper  is  to  use  the  regular  expressions  as  patterns.  One  can 
view  the  patterns  as  describing  the  ways  of  merging  several  sequences  or  of  selecting 
a  subsequence  from  one  sequence.  Regular  expressions  can  be  used  to  describe  many 
natural  patterns  and  in  a  simple  way.  This  fact  leads  to  powerful,  yet  simple,  declarative 
languages  on  databases  with  sequences.  As  an  example  of  using  the  sequence  operations 
in  query  languages,  a  simple  data  model  involving  sequences  and  its  query  language  are 

*This  report  summarizes  the  work  supported  by  the  Air  Force  Office  of  Scientific  Research  (AFOSR) 
grant  89-0244.  A  more  detailed  presentation,  including  formal  proofs,  will  be  provided  in  the  author’s 
Ph.D.  thesis. 
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defined  in  this  paper.  The  data  model  is  basically  a  relational  one  but  with  every  entry 
of  a  tuple  being  a  sequence  of  (zero  or  more)  (basic)  values.  This  is  an  overly  simplified 
complex-object  database  model  (which  contains  neither  sets  nor  nested  structures).  The 
simplicity  of  the  model  permits  one  to  focus  attention  on  sequences.  Nevertheless,  the  use 
of  the  sequence  operations  in  the  query  language  over  this  simple  model  is  quite  general. 
The  sequence  op^ation?  can  also  b**  used  in  query  languages  over  inure  complicated  uaia 
models. 

The  rest  of  the  paper  is  organized  as  follows.  In  Section  2,  the  motivation  of  the 
work  is  presented.  In  Section  3,  the  sequence  operations  based  on  regular  expressions  are 
formally  defined.  Also  in  Section  3,  a  type  of  a-transducer  ig  defined  to  characterize  the 
operations  based  on  regular  expressions  and  to  serve  as  “operational  semantics”  of  the 
operations.  Some  properties  of  the  operations  are  studied  in  Section  3.  The  data  model 
mentioned  above  is  defined  in  Section  4.  In  Section  5,  an  extension  of  SQL,  SSQL  or 
Structured.  Sequence  Query  Language ,  is  proposed  using  the  sequence  operations  described 
in  Section  3.  We  conclude  the  paper  with  some  remarks  in  Section  6. 

2  Background  and  Motivations 

Sequence  constructors  are  used  in  applications  where  order  is  significant.  For  example, 
consider  the  tour  schedules  in  Figure  1,  where  the  order  of  the  cities  being  visited  in  a 
tour  is  as  displayed  (thus,  the  cities  are  not  simply  in  a  set).  The  following  are  two  natural 
queries: 

(1)  Find  all  tours  whose  second  city  is  Atlanta;  and 

(2)  Find  all  tours  visiting  only  the  cities  in  Tour  356,  and  in  the  same  order. 

To  handle  these  queries,  at  least  two  methods  in  the  literature  can  be  used. 

One  method  is  to  view  Figure  1  as  a  nested  relation  with  arrival  and  departure  dates 
as  time  stamps  (see,  for  example,  [6,  10,  18]  and,  to  some  extent,  [7]).  The  order  of  the 
cities  in  a  tour  is  implicit  here,  since  the  order  is  according  to  the  arrival  dates  of  the 
cities.  Using  the  implicit  orders,  two  sub-phrases  may  be  necessary  to  express  query  (1): 
one  to  make  sure  that  Atlanta  appears  in  the  tour,  and  another  to  see  that  there  is  one 
and  only  one  city  before  it.  This  is  a  complicated  way  of  dealing  with  such  a  simple  query. 

Another  method  is  to  directly  view  the  cities  as  sequences,  i.e.,  the  order  of  the  cities 
is  the  order  they  appear  in  the  sequence.  In  this  method,  the  sequence  form  permits  the 


Figure  1:  Tour  schedules. 

use  of  indices  in  expressing  queries.  For  example,  in  the  query  language  of  02  [5],  query 

(1)  above  can  simply  be  written  as: 

Select  x.T0UR_N0 
From  x  in  Tourjschedule 
Where  x.City[2]= “Atlanta” 

However,  there  is  no  natural  way  of  expressing  query  (2)  only  using  indices,  since  query 

(2)  involves  the  question  of  whether  one  sequence  is  a  subsequence  of  another. 

In  this  paper,  we  use  explicit  sequences  and  employ  regular  expressions  as  operations 
on  sequences.  Both  query  (1)  and  (2)  can  then  be  easily  written  using  the  operations. 
Intuitively,  a  regular  expression  denotes  a  set  of  patterns  describing  how  to  merge  several 
sequences  into  a  new  one.  For  example,  (21X2)*  represents  patterns  indicating  that  the 
elements  of  odd  numbered  positions  and  the  elements  of  even  numbered  positions  of  a 
merged  sequence  are  from  different  sources  [xi  and  X2,  respectively).  Because  of  this,  we 
use  [(xjXj)*^  as  a  binary  merging  operation.  For  each  pair  of  sequences  of  equal  length 
ui  and  «2,  f(xix2)*j2(«i,tt2)  is  the  “perfect  shuffle”  of  the  pair.  See  Figure  2  below. 

Similarly,  we  can  match  the  patterns  described  by  a  regular  expression  against  a  single 
sequence  and  extract  out  subsequences  from  some  target  positions.  For  example,  we  can 
use  («!  Ux2)*  to  match  against  sequences  and  extract  out  the  elements  from  the  positions 
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Figure  2:  The  merging  operation  |(xix2)*]]2  on  <*&c  an(l  def . 

Xi  occupies.  We  use  |(*i  Ux  2)*l2"1  to  denote  such  an  operation  (where  the  superscript  — 1 
means  to  extract  out  the  elements  from  the  positions  of  ®i).  See  Figure  3  for  an  example. 
Obviously,  [(ii  U  x2)*])2  ^(u)  returns  all  subsequences  of  u. 
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Hatch  against  the  pattern 
A  pattern  in  (ii U  x2 )* 
Extract  according  to  x\ 


Figure  3:  The  selecting  operation  [(®i  U  x2)*  J2 1  on  abcdef . 

Utilizing  a  selecting  operation,  query  (2)  can  easily  be  expressed  as  follows  in  SSQL 
(defined  in  Section  4): 

Select  Tour_NO 
From  Tourjschedule 

Where  City  in  [(xi  U  z2)*l^(Select  CITY 

From  Tourjschedule 
Where  Tour_NO=356  ) 

in  which  the  where  clause  tests  whether  a  sequence  is  a  subsequence  of  another  one. 
Also,  for  each  positive  integer  t,  we  may  use  u[t]  as  a  shorthand  for  [xi_1z2xjj2: 2(u),  i-e., 
extracting  the  ith  element  of  u.  We  can  then  express  query  (1)  as  easily  as  the  query 
language  of  02  does. 


3  Sequence  Operations 


In  this  section,  we  formally  define  a  family  of  operations  over  sequences  and  study  their 
properties.  Throughout  this  section,  we  assume  an  infinite  alphabet  E,*,,  i.e.,  an  infinite 
set  of  abstract  elements,  and  a  countably  infinite  set  of  variables  V^,  with  D  E,*,  =  0. 

3.1  Regular  Expressions  as  Sequence  Operations 

In  order  to  define  sequence  operations  based  on  regular  expressions,  the  following  notion 
is  needed. 

Notation  Let  a:  be  a  variable  in  and  C  a  finite  subset  of  E,*,.  Then  P(x,C)  is  the 
set  of  boolean  formulas  with  x  —  c  and  i/c,  where  c  is  in  C,  as  atomic  ones. 

A  sequence  u  is  said  to  satisfy  p  in  P(x,(7)  if  and  only  if  each  element  of  u  satisfies 
p.  The  empty  sequence  e  satisfies  all  conditions.  Let  Vn  =  {xlf . . . , xn}  be  a  set  of  n 
variables.  Then  P(VB,C)  =  U«€v„  P(*,  C).  Let  P  be  a  subset  of  P(Vn,C).  Then  P(x;) 
denotes  the  formula  (ptl)  A  ...  A  (p,*)  where  {pn, . . .  ,pik}  =  P  fl  P(xi,  C ). 

To  define  the  sequence  operations,  we  also  borrow  the  notions  of  projection  7r  and 
selection  a  from  relational  algebra.  In  particular,  let  (1)  . . .  un)  =  vx . . .  vn  where  vj 

is  a  projection  on  the  ith  column  (in  the  relational  algebra  sense)  of  (the  tuple)  Uj  for 
each  1  <  j  <  n  and  (r)  <rv(u)  be  the  maximum  subsequence  of  u  whose  element  satisfies 
the  condition  <p.  We  are  now  ready  to  define  the  (sequence)  merging  operations. 

Definition  1  Let  n  >  1  be  a  positive  integer,  C  a  finite  subset  of  Eo,  Q  a  regular 
expression  over  Vn  =  {xlf . . . ,  x„}  and  P  in  P(K„  C ).  Then  Pjn  is  an  n-ary  sequence 
operation,  called  n-ary  (sequence)  merging  operation ,  such  that  for  all  subsets  L\,  . . . ,  Ln 

ofS«, 

|Q,Pjn(Lx, . . .  Ln)  =  {u|there  exists  v  in  (E,*,  x  Vn)*  such  that  u  =  ^(v),  t2(v)  is 
in  L(Q)  and,  for  each  1  <  *  <  n,  iri<r2=a,.(v)  is  in  L<  and  satisfies  P(xj)}. 

Thus,  an  n-ary  merging  operation  defines  a  mapping  from  n  sets  of  sequences  (over 
Eqo)  to  a  single  set  of  sequences  (over  Eo,).  In  a  merging  operation  f£?,PJn,  Q  describes 
patterns  used  to  select  elements  from  corresponding  input  sequences  to  form  merged 
ones,  and  P  acts  as  a  filter  used  to  allow  only  certain  input  sequences  participating  in  the 
merging.  (When  P  is  empty,  it  is  usually  omitted.) 


Example  Consider  [(®i®2)*,  (®2  /  f)h{{ab},{cd,ef}).  Since  ef  does  not  satisfy 
P(x2)  =  (®2  7^  /),  ef  is  excluded  from  the  merging.  Therefore,  only  ab  and  cd  are 
going  to  be  merged  according  to  the  patterns  described  by  ( XiX2 )*.  Since  iix2®i®2  is  a 
word  in  £((11X2)*))  the  sequence  acbd  is  obtained.  It  is  easy  to  see  that  acbd  is  the  only 
result.  Hence,  I(x1x2)*,(®2  i t  f)h({ab}Acd,ef})  -  {acbd}. 

Example  It  is  easy  to  see  that  |®I®2]2(£i>  £2)  returns  the  concatenation  of  L\  and  L2, 
i.e.,  [xJxj^C-^i?  £2)  is  the  set  of  all  sequences  of  the  form  uv  where  u  is  in  £ 1  and  v  is  in 
£2. 

Regular  expressions  can  also  be  used  to  extract  out  subsequences  from  given  sequences 
as  follows. 

Definition  2  Let  |Q,P]]n  be  a  sequence  merging  operation  and  I  a  subset  of  {1, . . .  ,n}. 
Then  lQ,P J"7  is  a  unary  sequence  operation,  called  an  n-ary  ( sequence )  selecting  opera¬ 
tion ,  such  that  for  each  subset  £  of  E^, 

lQ,Pj~I{L)={u)  there  exists  v  in  (E,*  X  Vn)*  such  that  tt^u)  is  in  L ,  7t2(d)  is  in 

L(Q),  u  =  v1(T2^XI{y)  and,  for  each  1  <  i  <  n,  ir  1cr2=Xj(u)  satisfies  P(x;)} 

where  Xj  =  {xi|i  in  I}. 

Thus,  an  n-ary  selecting  operation  defines  a  mapping  from  a  set  of  sequences  to  another 
set  of  sequences.  When  I  consists  of  only  a  single  element,  we  will  write  the  element 
instead  of  the  set  consisting  of  that  element. 

Intuitively,  in  a  selecting  operation  Pjj“J,  Q  defines  a  set  of  patterns  to  extract  out 
subsequences  from  designated  (by  J)  positions  of  input  sequences,  and  P  describes  the 
conditions  certain  positions  of  the  input  sequences  must  satisfy. 

Example  Consider  |xjx2x3,(x2  =  b)]23({abcd}).  Now  (i)  xj.x2x3x3  is  in  £(x*x2X3),  (ii) 
abed  is  an  input  sequence,  and  (iii)  6  satisfies  P(x2)  =  (x2  =  b )  where  b  is  the  element 
in  abed  corresponding  to  x2  in  xix2x3x3.  Hence,  elements  in  abed  corresponding  to  x3  in 
xxx2x3x3  are  extracted.  Therefore,  cd  is  one  of  the  selecting  results.  It  is  easy  to  see  that 
it  is  the  only  result.  Thus,  |xjx2®3,(®2  =  &)3£3({a&c<0)  =  {cd}. 

Example  Let  fc  be  a  positive  integer  and  £  a  subset  of  E^,.  Then  I®i®2]]2  *(£)  = 

Prefix -YL)  and  x‘x^[ ]J2(Z)  =  c'uffix:.(£). 


Consider  a  composition  of  merging  and  selecting  operations  in  the  following  example. 

Example  Let  F(L)  =  I(x1a;2)*322(S(®i*2)*l2(I[(*i*2)*]l21(-£,)>  (£)))•  K  L  consists 

of  an  even  length  word  to,  then  F(L)  returns  the  set  consisting  of  the  first  half  of  to.  For 
example,  F({abcd})  =  {afe}.  Notice  that  two  variables  are  used  in  defining  F.  The  same 
operation  cannot  be  realized  with  only  one  variable.  Later  we  will  see  that,  in  general, 
additional  mappings  can  be  realized  if  more  variables  are  used. 


3.2  Generic  a-Transducers 

t 

We  next  define  a  type  of  a-transducer  which  gives  an  “operational  semantics”  to  the 
sequence  operations  based  on  regular  expressions.  We  now  formally  define  the  device. 

Definition  3  A  generic  n-tape  sequential  transducer  with  accepting  states,  abbreviated 
generic  n-tape  a-transducer,  is  a  7-tuple  Mn  =  (n,  C,  K,  x,  H,po,  F),  where 

(1)  n  is  an  positive  integer. 

(2)  C  is  a  finite  subset  of  (the  constants). 

(3)  If  is  a  finite  set  (of  states). 

(4)  x  is  not  in  C  (the  variable). 

(5)  H  is  a  subset  of  K  x  (C  U  {x})  x  {1, . . . ,  n}  x  K  x  (C  U  {x,  e})  such  that 
for  each  (pi,a,  t,p2,b)  in  H,  either  b  =  a  or  b  =  e. 

(6)  po  is  in  K  (the  start  state). 

(7)  F  C  K  (the  set  of  accepting  states). 

The  variable  symbol  x  in  a  generic  n-tape  a-transducer  acts  as  a  place  holder.  When¬ 
ever  a  symbol  not  in  C  is  seen  by  the  device  it  uses  a  transition  rule  containing  x. 
Therefore,  a  generic  a-transducer  is  like  a  pattern  or  a  schema.  The  behavior  of  a  generic 
a-transducer  is  formally  defined  in  the  following.  First  though,  we  need  some  auxiliary 
notions.  We  will  use  d  to  denote  d  or  e  in  the  remainder  of  this  section. 

Notation  Let  Mn  =  (n,C,  K,x,H,p0,  F)  be  a  generic  n-tape  a-transducer  and  A  a  set. 
For  each  h  =  (pi,  x,  t,p2,  x)  in  H  and  a  in  A,  let  A[a]  =  (pi,a,t,p2,d)  such  that  a  —  e  if 
and  only  if  x  =  e.  Let1  H[A\  =  {h[a]|h  in  H  and  a  in  A}. 


'If  h  does  not  contain  *,  h[a]  =  h  for  each  a. 


Notation  Let  Mn  =  (n,  (7,  K,  x,  H,p0,  F)  be  a  generic  n-tape  a-transducer.  Define 
(or  h  when  Mn  is  understood)  to  be  the  relation  on  K  x  E^,  x  ...  x  E^  (E^  appears 
n  -f  1  times)  by  letting 

(pi,  •  •  •  >  awt, • . . ,  w„,  v)  h  (p2,  wj, . . . ,  wt, . . . ,  wn,  vb ) 

if  (p\,a,t,  p?,b)  is  in  iffE,*,  —  C].  Let  (-*  be  the  reflexive,  transitive  closure  of  Let  rfe 
be  the  relation  (H)fc  for2  k  >  0. 

We  are  now  ready  to  define  the  mapping  performed  by  a  generic  n-tape  a-transducer. 

Definition  4  Let  Mn  =  (C,  K,x,  H,p0,  F)  be  a  generic  n-tape  a-transducer  and  L  be  a 
set  of  sequences  on  E,*,.  Then  Mn(L\ , . . . ,  Ln )  =  {tn|(po,  Vi, . . . ,  v„,  £•)(-*  (p,  £, . . .  ,  s,  w)  for 
some  p  in  F  and  Vi  in  Li  for  all  1  <  i  <  n}. 

The  next  result  shows  that  thq,  “composition”  of  generic  a-transducers  is  also  a  generic 
a-transducer.  First  though,  we  define  a  subclass  of  generic  a-transducers. 

Definition  5  A  generic  n-tape  a-transducer  Mn  —  (n,C,  K.  x,H,p0,  F)  is  e-free  if,  for 
each  (pi,  a,  t,p2,b)  in  H,  b  =  a.  Mn  is  uniform  if,  for  each  pair  (p,  a,  t,p ',  a)  and  ( q ,  b,  t ,  q',  b) 
in  H,  (p,  b,  t,  p',  b)  is  also  in  E. 

Proposition  1  Let  fc  >  1  be  a  integer,  Mi  be  generic  n^-tape  a-transducer  for  each 
1  <  i  <  A:  and  M0  generic  fc'-tape  a-transducer  where  k'  >  k.  Then  there  is  a  generic 
(£in»  +  k'  —  A:)-tape  a-transducer  M  such  that  for  all  subsets  (1  <  '  <  k.  1  <  j  <  n;) 
of  E^,,  and  subsets  X/  (k  +  1  <  l  <  k')  of  E^,,  M{LU . . . ,  Ln, . . . ,  Lknk,  Xfc+i,  •  •  • ,  Lk<) 
=  M0(Afi(Ii1,...,Xi„1),...,ilffc(Xfci,...,XfenJ,Xfc+i,...,LfcO-  Furthermore,  M  is  s-free 
(uniform)  if  each  Mi  (0  <  i  <  k)  is  e-free  (uniform). 

3.3  Characterizations  and  Properties  of  the  Sequence  Opera¬ 
tions 

In  this  subsection,  we  characterize  the  sequence  operations  in  terms  of  the  generic  a- 
transducers  and  study  several  properties  of  the  operations.  The  first  result  shows  that 
£-free  uniform  n-tape  a-transducers  have  the  same  expressive  power  as  the  n-ary  merging 

"LX  be  .he  r-iati  :?.. 


operations.  Notice  that  a  mapping  from  E^,  x  .. .  x  E^  (E^,  appears  n  times)  to  is 
called  an  n-ary  sequence  operation. 

Theorem  1  Let  F  be  a  sequence  operation.  Then  F  is  equivalent  to  an  n-ary  merging 
operation  if  and  only  if  it  is  equivalent  to  an  n-tape  £-free  uniform  generic  a-transducer. 

It  is  easy  to  see  that  not  every  a-transducer  is  equivalent  to  an  e-free  uniform  one. 
Therefore,  the  above  theorem  suggests  that  there  exists  an  a-transducer  which  has  no 
equivalent  merging  operation.  There  are  examples  which  show  that  it  is  indeed  the  case. 
Thus,  we  have: 


Theorem  2  For  each  n  >  1,  there  are  n-ary  a-transducers  which  have  no  equivalent 
merging  operations. 


Theorem  1  also  shows  that  each  merging  operation  has  an  equivalent  uniform  e-free 
a-transducer.  Therefore,  by  Theorem  1  and  Proposition  1,  the  following  corollary  holds. 

Corollary  1  Let  Poflj,  be  a  k- ary  merging  operation  and  P{]jn,  be  n,- ary  merging 
operations  for  each  1  <  i  <  k.  Then  there  exists  a  (X)  n^)-ary  merging  operation  [Q,  P|n 
such  that  [Q,  P]jn(Lii, . . . ,  Li„iy . . . ,  Lki,  —  =  \Qo,  Poi!fc(i[^?i>  Piln,(-Liii  •  •  •  >  Lin, ), 

■  •  •  1  ttQfc)  PfeJn4(Lfcli  ■  •  •  1  Lknk))- 


The  next  theorem  shows  that  the  selecting  operations  are  equivalent  to  unary  a- 
transducers  (not  necessary  uni'orm  or  £-free  ones). 

Theorem  3  Let  Fi  be  a  unary  sequence  operation.  Then  F\  is  equivalent  to  a  generic 
1-tape  a-transducer  if  and  only  if  it  is  equivalent  to  a  selecting  operation. 

By  the  above  theorem  and  Proposition  1 ,  it  is  easily  seen  that  the  selecting  operations 
are  closed  under  composition. 

Corollary  2  Let  |Qi,  Pi]]^1  and  [QzjPzJn^1  be  two  selecting  operations.  Then  there  is 
a  IQs,  Pal; such  that  {Q3,  P3]^(L)  =  IQx,  P^*  ([<?*,  P2l;f’(i))  for  all  L. 


We  now  consider  the  following  question:  Can  one  define  additional  sequence  opera¬ 
tions  by  using  more  variables?  The  answer  is  positive.  Instead  of  giving  examples  of  the 
sequence  operations  realizable  using  n  variables  but  not  n  —  1  variables,  we  first  char¬ 
acterize  when  a  merging  operation  is  decomposable,  i.e.,  is  a  composition  of  some  other 
merging  operations.  Obviously,  if  and  only  if  a  merging  operations  is  decomposable,  it 
can  be  realized  by  using  fewer  variables. 

Notation  Let  V  be  a  subset  of  Vn,  w  and  u  sequences  over  Vn  and  Q  a  regular  expression 
such  that  L(Q)  =  {u>}.  Then  vj[V/Iu\  =  fui1|[^)]^jrv'(u;1)  =  u  and  fQJn7'1'"-1' (u>i )  =  it'} 
where  Iy  =  {i|x,-  £  V }  and  Ivn-V  —  {*|a?»  £  Vn  —  V}.  Given  a  sequence  w  over  Vn  and 
a  subset  V  of  Vn ,  let  Q  be  a  regular  expression  such  that  L(Q)  =  {w}.  Then  w\V  = 
where  Iy  =  {f|xi  €  V}.  Let  L  be  a  subset  of  V^*.  Then  L\V  =  UU)6Lu;i^  - 

Definition  6  Let  Q  be  a  regular  expression  over  V„.  A  subset  V  of  Vn  is  said  to  oe 
independent  in  Vn  with  respect  to  Q  if  L(Q)  =  and  u€£((?)|r  w[V//v\. 

Theorem  4  Let  jQ,P]n  be  an  n-ary  merging  operation.  Then  f Q,P]n  is  decomposable 
if  and  only  if  there  is  a  proper  subset  V,  with  #(V)  >  1,  of  Vn  such  that  V  is  independent 
in  Vn  with  respect  to  Q. 

To  see  whether  a  merging  operation  |Q,  is  not  decomposable,  we  thus  only  need  to 
test  that  each  proper  subset  of  Vn  of  arity  greater  than  1  is  not  independent  with  respect 
to  Q.  For  example,  [(xix2  U  x2x3)*||3  is  easily  seen  not  decomposable.  It  is  also  easy  to 
see  that  for  each  n  >  2,  [[(x^  U  x2x3  U  •  •  •  U  xn_ixn)*]]  is  not  decomposable.  Hence,  we 
have  the  following  result. 

Theorem  5  For  each  n  >  2,  there  is  an  n-ary  merging  operation  which  is  not  decompos¬ 
able. 

Let  \Q,P\n*  be  a  selecting  operations.  If  P  is  empty,  then  it  is  easily  seen  that  there 
is  a  regular  sequence  Qx  over  V2  such  that  [Q^ 1  is  equivalent  to  If  P  is  not 

empty,  then  we  can  see  that  the  more  variables  used,  the  more  mappings  can  be  realized. 

4  Tuple  Sequence  Database  Model 

In  this  section,  a  data  model  ;s  described  which  is  basically  relational,  but  uses  sequences 
of  ■  as  entr"  •*  -  ■■■  \ 


In  the  following,  U  is  assumed  to  be  a  non-empty  set  of  attribute  names.  Members 
of  U.  are  denoted  by  A,  B  and  C  etc.,  possibly  subscripted.  For  each  A  in  li,  there  is  an 
associated  non-empty  set  DOM(A).  Let  7 Z  denote  the  set  of  all  finite  subset  of  U.  Each 
member  of  72.  is  a  relational  schema  and  is  denoted  by  R,  possibly  subscripted.  Each  finite 
subset  of  72  is  called  a  database  schema. 

In  the  Tour-schedule  example,  suppose  U  contains  TOUR_No,  CITY,  ARRIVAL,  DE¬ 
PARTURE  and  Cost.  Then  {TOUR_NO,  CITY,  ARRIVAL,  DEPARTURE,  COST}  is  a 
member  of  7 Z,  and  therefore  a  relational  schema. 

For  each  A  in  U,  Seq(DOM(A))  is  the  set  of  finite  sequences  of  elements  from  DOM(A), 
i.e.,  Seq(DOM(A))  =  {ax . . .  an|n  >  1  and  a *  in  DOM(A)  for  each  1  <  i  <  n}U{e},  where 
e  is  the  empty  sequence.  A  tuple  t  of  R  ={Ai,. .  .,Am}  in  72.  is  a  total  function  from  R  to 
U£x  Seq(DOM(Ai))  such  that  t(Ai)  is  in  Seq(DOM(A;))  for  each  1  <  i  <  m. 

A  tuples  is  usually  presented  in  table  form.  In  the  tour  schedule  example,  San  Francisco 
and  Denver  are  in  DOM(ClTY).  Therefore,  the  sequence  “San  Francisco,  Denver”  is 
in  Seq(DOM(ClTY)).  Similarly,  556  is  in  DOM(ToUR-No)  and  the  sequence“556”  in 
Seq(DOM(TouR_No)).  Hence,  we  have  the  following  tuple: 


TourJMo. 

City 

Arrival 

Departure 

Cost 

556 

San  Francisco 

Denver 

3/21/90 

3/23/90 

3/23/90 

3/25/90 

699 

An  instance  of  a  relational  schema  R  is  a  finite  set  of  tuples  over  R.  A  database 
instance  I  of  a  database  schema  {/Zi, . . . ,  Rk }  is  a  sequence  Irx  , . . . ,  lRk  such  that  Ir^  is 
an  instance  of  Ri  for  each  I  <  i  <  k.  In  the  tour  schedule  example,  the  table  in  Figure  2  is 
an  instance  of  the  relational  schema  {ToUR_No,  CITY,  ARRIVAL,  DEPARTURE,  Cost}. 

The  model  presented  above  is  a  proper  extension  of  the  relational  model  [8].  Fur¬ 
ther  extensions  may  be  necessary  for  practical  purposes.  For  example,  instead  of  using 
sequences  of  only  basic  values,  we  may  allow  sequences  of  tuples  as  elements  of  another 
tuple.  However,  for  simplicity,  we  only  consider  the  simple  model  presented. 

5  SSQL:  an  Extension  of  SQL 

In  this  section,  we  propose  an  extension  of  SQL,  SSQL  or  Structured  Sequence  Query 
Language,  as  a  query  language  over  the  data  model  defined  in  the  last  section.  The 
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query  language  is  essentially  SQL  but  with  sequence  operations  appearing  in  the  Select 
and  Where  clauses  in  original  SQL  queries.  Because  of  the  length  restriction,  we  only 
consider  the  select  statement.  Also,  we  will  be  quite  informal  in  presenting  SSQL.  In  the 
following,  a  sequence  operation  F  is  called  single  valued  if  F(Ui, . . . ,  Un )  contains  only  a 
single  sequence  whenever  £/,  is  a  set  of  one  sequence  for  all  1  <  i  <  n. 

We  begin  with  an  example  to  show  the  basic  features  of  SSQL.  The  syntax  will  be 
defined  later.  Consider  the  query  “Display  the  tour  numbers  and  the  second  cities  of  all 
tours  visiting  only  the  cities  in  Tour  356  and  in  the  same  order.”  In  SSQL,  the  query  can 
be  expressed  as  follows. 

(1)  Select  Tour_no,  City[2] 

(2)  From  Tour_schedule 

(3)  Where  City  in  \(Xl  U  z2)*J 2  *( 


(4) 

Select  City 

(5) 

From  Tourjschedule 

(6) 

Where  Tour_no=356  ) 

Lines  (4)-(6)  are  as  an  ordinary  SQL  query  and  return  the  sequence  of  cities  of  tour  356. 
Line  (3)  is  another  form  of  u  €  U  x2]2  1('u0  where  u  is  the  list  of  the  cities  in  question 
and  w  is  the  list  of  the  cities  returned  from  lines  (4)-(6).  Therefore,  line  (3)  tests  if  the 
cities  in  question  is  a  subsequence  of  the  cities  of  tour  356.  Finally,  line  (1)  returns  the 
tour  number  which  satisfies  the  test  of  line  (3)  and  the  second  city  of  the  tour.  Note  that 
City[2]  is  a  shorthand  of  Jxix2xJ]]2 z(ClTY). 

Notice  that  in  the  above  example,  we  used  sequence  operations  in  a  set  membership 
test  and  in  the  select  clause.  In  general,  an  SSQL  select  statement  is  in  the  form: 

Select  F\ ( R\\.Ar^ , . . . ,  Riri  .Ari  ),•••,  F i( Ru .Ar(,  •  • . ,  .Ar() 

From  Rlt...,Rk 
Where 

where  Fj  is  a  r*- ary  sequence  operation  for  each  1  <  i  <  l  (notice  that  the  attribute 
names  in  a  specific  F*  are  all  the  same  and  if  Fj  is  the  unary  operation  [xjji,  then  it  is 
usually  omitted),  V  is  a  formula  involving  logical  connectives  and,  or  and  not,  arithmetic 
comparison  operators  =,  <  =  ,  and  so  on,  set  comparison  operators  C  and  D  etc.,  and  the 
membership  test  in.  When  the  arithmetic  comparison  operators  are  used,  each  element 


of  the  sequence  on  the  left  hand  side  is  compared  to  each  element  of  the  sequence  on 
the  right  hand  side,  and  the  result  of  the  arithmetic  comparison  is  true  if  and  only  if  all 
these  comparisons  are  true.  A  (nested)  select  statement  can  appear  in  either  side  of  a 
set  comparison  and  the  right  side  of  in.  Single  valued  sequence  operations  can  be  used 
on  the  sequences  on  either  side  of  all  comparisons,  while  other  sequence  operations  can 
only  be  used  on  sequences  on  either  side  of  a  set  comparison  and  on  the  right  side  of 
a  membership  test.  Finally,  only  one  attribute  can  appear  in  select  clause  of  a  nested 
statement. 

As  SQL  queries,  SSQL  queries  can  be  interpreted  in  tuple  calculus  forms.  The  general 
form  of  SSQL  query  can  be  translated  into  the  following: 

{s|(3ti, . . . ,  £fc)(l?i(fi)  A  . . .  Rk(th)  A  $  A  s.Ar,  6  Pi(tn.Arii . . . ,  tiri.Ari)  A  . . . 

A s.Ari  £  Pi(tn.Ari , . . . ,  tfr, ~Ari  ))}• 

The  formal  definition  and  detailed  discussion  of  the  tuple  calculus  over  the  data  model 
presented  in  Section  3  is  omitted  from  this  paper.  However,  we  can  see  that  the  semantics 
of  SSQL  is  very  similar  to  that  of  SQL. 

We  will  not  go  into  a  detailed  discussion  of  SSQL  here.  Instead,  we  give  some  example 
queries  written  in  SSQL.  The  examples  are  based  on  the  tour  schedule  example  in  Section  1 
(see  Figure  2). 

Example  “List  all  pairs  of  tours  such  that  the  time  periods  of  the  two  tours  do  not 
overlap.” 

Select  fi.TOUR._NO,  t2.TOUR_NO 

From  Tour  jschedule  ti ,  Tour  .schedule  t2 

Where  |[xJx2Jj2(ti.DEPARTURE)  <=  f2.ARRIVAL[l] 

Following  the  convention  of  SQL,  ti  and  t2  are  aliases  for  the  relation  Tour  jschedule. 

Example  “Find  all  pairs  of  tours  such  that  the  first  city  of  the  second  tour  is  within  the 
first  tour.” 

Select  fi.TOUR_NO,  f2.TOUR_NO 
From  Tour_schedule  ft,  Tour_schedule  f2 
Where  f2.CiTY[l]  in  fxjx2xj|2  2(fi.ClTY) 


1? 
A  •> 


This  example  shows  that  testing  whether  an  element  is  in  a  sequence  can  be  expressed 
easily. 

Example  “Find  out  all  tours  in  which  Los  Angeles  and  San  Francisco  are  visited  con¬ 
secutively.” 


Select  Tour_no 
From  Tour_schedule 

Where  San  Francisco  in  \xlx2x3x\,x2  =  Los  Angeles]]^ 3(City)  or 
Los  Angeles  in  x2  =  San  Francisco]^ City) 

Example  “List  all  tours  which  visit  at  least  two  more  cities  between  Los  Angeles  and 
San  Francisco.” 

Select  Tour_no 
From  Tour_Schedule 

Where  ClTY  in  lxlx2x3x2  xAxl,x2  =  Los  Angeles, 

X4  =  San  Francisco]^  ^1,2’3,4^(CiTY) 

Example  “List  all  tours  which  visit  an  even  number  of  cities.” 

Select  Tour_no 

From  Tour_Schedule 

Where  ClTY  in  t(xix2)*l2 ^’^(City) 

Example  “List  all  tours  which  visit  the  same  number  of  cities  as  Tour  456.” 

Select  Tour_no 
From  Tour_Schedule 

Where  ClTY  in  [(x^J’J^CKx^J'MClTY, 

Select  City 

From  TourJSchedule 

Where  Tour_NO  =  456)) 


6  Related  Research  and  Conclusion 


One  of  the  papers  in  the  literature  devoted  entirely  to  the  sequences  in  databases  is  [12]. 
Indeed,  the  only  data  constructor  in  the  data  model  of  [12]  is  the  sequence.  An  algebraic 
query  language  is  proposed  on  the  nested  sequences  in  the  data  model.  However,  the 
selection  of  operations  is  a d  hoc  in  nature.  There  is  no  single  formal  mathematical  system 
behind  the  operations.  Nevertheless,  [12]  introduces  many  powerful  operations  which 
cannot  be  simulated  by  the  sequence  operations  of  this  paper.  It  may  be  interesting  to 
see  how  to  extend  the  sequence  operations  of  this  paper  to  include  them. 

The  Tangram  stream  query  processing  system  [16,  17]  is1  an  attempt  to  use  streams 
(or  sequences)  as  a  common  processing  model  both  in  AI  systems  and  database  systems. 
The  relations  in  a  relational  database  are  viewed  as  sequences  of  tuples  rather  than  sets 
of  tuples.  Query  processing  then  becomes  sequence  processing.  A  powerful  combination 
of  logic  and  functional  programming  language  Log(F)  is  used  as  a  base  system  in  stream 
(sequence)  processing.  Since  Log(F)  is  a  general  programming  language,  all  computable 
transformations  of  sequences,  including  the  highly  intractable  ones,  are  possible.  However, 
it  is  not  clear  how  tuples  and  sets  can  retain  their  own  characteristics  in  Log(F). 

Our  interest  in  the  queries  on  sequences  stems  from  the  study  of  a  group  of  context- 
related  interval  queries  [11].  However,  the  focus  of  [11]  is  on  the  preservation  of  the 
computation-tuple  sequence  schemes.  It  is  worth  noting  that  all  queries  in  [11]  can  basi¬ 
cally  be  simulated  by  the  sequence  operations  defined  in  this  paper. 

The  contribution  of  this  paper  is  the  introduction  of  the  first  family  of  declarative 
sequence  operations  based  on  a  theoretically  sound  formalism,  namely,  regular  expressions. 
It  is  shown,  through  examples,  that  sequence  operations  are  quite  natural  as  well  as 
powerful. 

Using  sequence  operations  in  query  languages  is  straightforward.  We  illustrate  this  in 
a  simple  extension  of  SQL.  Actually,  the  way  of  using  the  sequence  operations  in  SSQL 
is  rather  general.  In  the  same  way,  we  can  extend  various  query  languages  on  complex 
objects  to  include  these  sequence  operations  to  handle  sequences. 

Several  important  questions  remain  to  be  answered.  One  of  them  is  the  completeness 
of  these  operations.  In  what  sense  is  a  set  of  sequence  operations  complete?  Obviously, 
there  are  still  many  conceivable  operations  not  representable  in  the  family  introduced  in 
this  paper.  Hence,  another  question  is  how  to  extend  the  family. 
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Abstract 

This  report  looks  at  two  questions  about  the  Spreadsheet  History 
Model.  First,  several  operations  on  spreadsheet  histories  (selection,  projection, 
cohesion,  intersection,  and  union)  which  are  analogues  of  some  common  rela¬ 
tional-database  operations  are  examined.  The  primary  question  of  concern  is 
whether  or  not  their  results  are  SHS  representable.  Necessary  and  sufficient 
conditions  are  presented  for  those  operators  which  do  not  always  preserve  the 
SHS  formalism.  The  second  question  concerns  a  notion  of  spreadsheet-history 
equivalence  based  on  the  computation-tuple-sequence  projection  operation 
( projection  simulation).  Three  sufficiency  conditions  are  given  for  determining 
when  projection  simulation  preserves  the  history-bounded  SHS. 


*  This  report  summarizes  the  results  concerning  the  spreadsheet-history  mode)  which  were  obtained 
under  funding  from  the  Air  Force  Office  of  Scientific  Research  (AFOSR)  grant  89-0244.  A  more  derailed 
presentaricn  will  appear  in  [X  9  .1. 


1  Introduction 


One  of  the  most  widely  used  types  of  small-business  data-processing  software  is  the 
spreadsheet  program  [DLL  88;  Go  87;  WS  86].  The  broad  appeal  of  the  spreadsheet  is  due 
to  its  straightforward  tabular  method  for  describing  computational  relationships  between 
data.  Spreadsheet  programs,  such  as  C-Calc  [DSD],  Excel  [Ms  89],  and  Lotus  1-2-3  [Lot 
85],  have  automated  the  design  and  use  of  spreadsheets.  While  there  have  been  discus¬ 
sions  on  spreadsheet-programming  methodologies  [Be  86;  RPL  89],  little  of  a  theoretical 
nature  is  known  about  spreadsheets  themselves.  The  purpose  of  our  research  is  to  rigor¬ 
ously  examine  a  data  model  for  describing  spreadsheet  histories  and  their  properties. 

The  present  report  consists  of  four  sections,  including  this  introduction.  Section  2 
gives  the  basic  definitions  of  the  spreadsheet-history  model.  The  model  describes  the  use  of 
spreadsheets  to  represent  historical,  accounting-like  data.  It  consists  of  sequences  of  com¬ 
putation  tuples  defined  by  a  spreadsheet-history  scheme  (SHS). 

Section  3  examines  several  operations  on  spreadsheet  histories  (selection,  projec¬ 
tion,  cohesion  [GTa  89],  intersection,  and  union)  which  are  analogues  of  some  common  re¬ 
lational-database  operations  [Co  70;  U1  82;  Ma  83].  The  primary  question  of  concern  here 
is  whether  or  not  their  results  are  SHS  representable.  The  answer  is  yes  for  selection,  co¬ 
hesion,  and  intersection,  and  no  for  the  others.  A  necessary  and  sufficient  condition  is 
given  for  projection  and  union  to  characterize  when  each  preserves  the  SHS  model.  Some 
additional  characteristics  of  selection,  projection  and  cohesion  are  also  presented. 

Section  4,  concerns  a  notion  of  spreadsheet-history  equivalence  based  on  the  compu¬ 
tation-tuple-sequence  projection  operation.  The  concept  of  projection  simulation  is  present¬ 
ed  and  three  sufficiency  conditions  are  given  for  determining  when  projection  simulation 
preserves  the  history-bounded  SHS. 


2  Spreadsheet  Histories 


In  this  section  a  formal  model  for  spreadsheets  and  their  histories  is  introduced. 

In  simple  terms,  a  spreadsheet  is  a  finite  set  of  related  data.  Each  datum  occupies  a 
unique  location  and  is  either  specified  directly,  using  a  constant,  or  indirectly,  using  a 
function.  Each  datum  location  is  called  a  cell.  In  Microsoft  Excel,  the  cells  are  arranged  in 
a  16,384  row  by  256  column  rectangle,  and  are  addressed  by  row  and  column  indices.  In 
principle,  the  number  of  rows  and  columns  in  a  spreadsheet  may  be  arbitrarily  large.  The 
functions  are  written  in  terms  of  the  data  locations  in  the  spreadsheet. 

EXAMPLE  1.1:  Consider  a  spreadsheet  representation  of  a  stock-purchase  history. 
Figure  1.  1  shows  the  spreadsheet  as  it  might  appear  using  the  Microsoft  Excel  program. 
Row  1  contains  text  that  indicates  the  meaning  of  the  data  in  each  column.  Each  of  the 
rows  2  through  12  contains  a  single  stock  transaction.  For  each  transaction: 

(1)  Information  is  recorded  for: 

DATE  The  date  on  which  the  transaction  occurred. 

TRANS  The  transaction  type,  either  BUY,  SELL,  cr  DIV(IDEND).  For  simplicity, 
the  dividend  transaction  only  records  dividends  which  are  disbursed  as 
shares  of  stock. 

SHARES  The  number  of  shares  of  stock  involved  in  the  transaction.  Again,  for 
simplicity,  only  whole  shares  of  stock  may  be  entered. 

PSV  The  per-share  value  of  the  stock  for  the  current  transaction  (i.e.  the  buy 
or  sell  price). 

(2)  And  values  are  calculated  for: 

VALUE  The  total  dollar  value  of  the  transaction. 

PROFIT  The  profit  earned  for  the  transaction.  (A  negative  value  indicates  a  loss.) 

If  TRANS  is  BUY,  then  the  profit  is  zero.  If  TRANS  is  DIV,  then  the  prof¬ 
it  is  equal  to  the  VALUE.  If  TRANS  is  SELL,  then  the  profit  requires  a 
more  complicated  calculation.  According  to  tax  laws,  when  a  share  of 
stock  is  sold,  the  profit  is  equal  to  the  price  received  minus  the  original 


1 

DATE 

TRANS 

SHARES 

PSV 

VALUE 

PROFIT 

CUMSH 

2 

7/2/86 

BUY 

1000 

$6.00 

$6,000.00 

$0.00 

1000 

3 

8/15/86 

BUY 

2000 

$5.50 

$11,000.00 

$0.00 

3000 

4 

9/10/86 

BUY 

3000 

$5.00 

$15,000.00 

$0.00 

6000 

_5_ 

12/31/86 

D1V 

50 

$6.00 

$300.00 

$300.00 

6050 

6 

1/20/87 

BUY 

4000 

$3.00 

$12,000.00 

$0.00 

10050 

7 

6/30/87 

DIV 

40 

$5.00 

$200.00 

$200.00 

10090 

8 

12/31/87 

DIV 

50 

$5.00 

$250.00 

$250.00 

10140 

9 

6/30/88 

DIV 

60 

$8.00 

$480.00 

$480.00 

10200 

JO. 

8/13/88 

SELL 

5000 

$10.00 

$50,000.00 

$23,000.00 

5200 

11 

12/31/88 

DIV 

30 

$8.00 

$240.00 

$240.00 

5230 

12 

2/22/89 

SELL 

3000 

$11.00 

$33,000.00 

$21,850.00 

2230 

Figure  1.  1 


price  paid.  When  selling  stock  you  are  required  to  sell  in  a  first-in  first- 
out  (FIFO)  order. 

CUMSH  The  cumulative  number  of  shares  of  stock  owned  at  the  completion  of  the 
current  transaction. 

The  data  in  cells  A2,  B2,  C2,  and  D2  are  entered  as  input.  The  numbers  in  cells  E2 
and  F2  are  calculated  using  the  formulas  (written  in  the  notation  employed  by  Microsoft 
Excel)  =D2*C2,  and  =SMacs!TProfit(ROW()),  respectively  and  in  the  specified  order.  [The 
formula  ROW()  returns  the  row  number  of  the  current  cell  and  SMacs!TProfit()  invokes 
the  TProfit  macro  stored  in  the  file  named  “SMacs.”  The  TProfit  macro  is  a  user-written 
function  that  calculates  the  profit  (using  the  FIFO  formula  required  by  the  tax  laws)  for 
the  transaction  recorded  on  the  row  number  passed  to  it.]  The  value  in  G2  is  also  entered 
as  input,  but  because  of  the  nature  of  the  application,  the  value  in  G2  must  agree  with  the 
value  in  C2. 

The  data  in  line  3  represents  the  second  event  in  the  stock-purchase  history.  The 
information  in  cells  A3  through  D3  is  specified  directly.  The  data  in  cells  E3,  F3  and  G3 
are  calculated  by  the  formulas  =D3*C3,  =SMacs!TProfit(ROW()),  and  =IF(B3="SELL",G2- 
C3,G2+C3)  respectively,  and  in  the  specified  order.  [The  formula  =IF (cond,exprl  ,expr2) 
evaluates  the  logical  expression  cond  and  returns  value  of  exprl  if  cond  is  true  and  the 
value  of  expr2  if  cond  is  false.]  Note  that  the  formula  in  cell  E3  is  a  relativized  version  of 


the  one  found  in  cell  E2.  Most  spreadsheet  programs  provide  a  command  to  copy  the 
formulas  from  one  column  to  another  in  this  relativized  fashion  —  see  the  “copy  and 
related  commands  in  [DSD;  Lot  85;  Ms  89]. 

For  the  remaining  lines,  the  data  in  columns  A,  B,  C  and  D  are  input  as  constants, 
the  numbers  in  columns  E  and  G  are  calculated  using  relativized  versions  of  the  formulas 
in  E3  and  G3  respectively,  and  the  values  in  column  F  are  calculated  using  the  same  for¬ 
mula  as  found  in  cell  F3.  The  calculations  are  performed  for  in  row  number  order.  C 

Each  line  (except  1)  of  the  spreadsheet  in  Figure  1.  1  represents  a  single  event  in  the 
stock-purchase  history.  In  use,  a  new  line  is  added  to  the  spreadsheet  each  time  some 
stock  is  received  or  sold.  The  history  is  modeled  by  a  sequence  of  events  or  transactions. 
This  type  of  historical  data  modeling  also  occurs  in  many  other  spreadsheet  applications, 
for  example  the  checkbook-management  spreadsheets  in  [CA;  DSD],  the  sales,  cash-flow, 
and  budget-forecasting  spreadsheets  in  [CA],  and  the  “what-if’  models  in  [Jo  89]. 

In  this  simplified  example,  each  stock  transaction  is  kept  on  a  single  line  of  the 
spreadsheet.  For  a  more  detailed  accounting,  it  may  be  preferable  to  display  each  event  on 
a  single  spreadsheet.  Such  details  of  data  display  are  important  human-factors  concerns, 
but  are  irrelevant  for  the  analysis  carried  out  in  this  manuscript. 

Before  proceeding  to  the  formal  model,  some  preliminary  definitions  are  in  order.  It  is 
assumed  that  there  exists  an  infinite  set  of  domain  values  (denoted  Dom„)  and  an  infinite 
set  of  attributes  (denoted  U„).  The  set  U*,  is  partitioned  into  two  infinite  disjoint  sets,  I„ 
and  Ew,  respectively  called  input  attributes  and  evaluation  attributes.  For  each  A  in  U^, 
Dom(A)  is  a  subset,  of  DomM  of  at  least  two  elements.  All  attributes  are  assumed  to  be  ele¬ 
ments  of  U^.  The  symbols  A,  B,  and  C  (possibly  subscripted  or  primed)  wall  denote  at¬ 
tributes  and  U,  V,  and  W  (possibly  subscripted  or  primed)  will  denote  nonempty,  finite 
sets  of  attributes.  As  is  customary  in  the  relational  database  literature,  the  union  sets  of 
attributes  will  be  denoted  by  juxtaposition.  So,  UV  shall  mean  UuV.  And,  committing  a 
slight  abuse  of  notation,  A B  and  UA  shall  respectively  mean  [A]  u  [B]  and  U  u  (AJ. 

There  is  a  total  order  <oo  over  U^,  such  that  (i)  A  c^,  B  for  each  A  in  1^  and  B  in  E^; 
and  (ii)  for  each  B  in  EM,  there  exists  a  B'  in  such  that  B  B'.  Let  X  be  a  finite  non- 


empty  subset  of  and  A1?  ^  the  listing  of  the  elements  of  X  according  to  <oo.  Then 

<X>  will  denote  the  sequence  Ax  An  and  Dom(<X>)  the  Cartesian  product  Dom(A1)  x  ••• 
x  Dom(An).  For  i  >  2,  <XIAj>  denotes  the  prefix  Ax,  Ai_1.  [A  prefix  of  a  sequence  p1  ... 
pm  is  a  subsequence  of  the  form  p:  ...  p;  for  some  i,  1  <  i  <  m.] 

An  important  aspect  of  Example  1.  1  is  the  regularity  of  form  exhibited  by  each  line  of 
data  in  Figure  1 .  1.  To  capture  this  detail  in  the  formal  model,  each  line  will  be  considered 
as  constituting  a  separate  spreadsheet  and  each  spreadsheet  will  be  represented  by  a  sin¬ 
gle  computation  tuple  defined  over  a  finite  set  of  attributes.  The  cells  of  a  spreadsheet 
which  are  specified  by  constants  will  be  modeled  by  input  attributes  while  those  cells 
which  have  values  determined  by  functions  will  be  represented  by  evaluation  attributes. 
The  entire  sequence  of  spreadsheets  will  be  represented  by  a  computation-tuple  sequence. 

Segmenting  the  attributes  into  inputs  and  evaluations  reflects  the  different  roles 
played  by  the  cells  in  the  spreadsheet.  This  partitioning  will  be  specified  by  an  attribute 
scheme.  An  attribute  scheme  over  <U>  is  an  ordered  pair  (<I>,  <E>),  where  <U>  =  <IxE>, 
I  =  n  U  *  0  and  E  =  E^  n  U  *  o.  [Given  sequences  of  attributes  cU^  =  Av  ...,  Am  and 
<U2>  =  Bp  ...,  Bn,  <U1><U2>  will  denote  the  concatenation  of  the  sequences,  i.e.,  Ax,  ..., 
Am,  Bp  ...,  Bn.]  That  is,  an  attribute  scheme  divides  <U>  into  a  sequence  of  input 
attributes,  I,  and  a  sequence  of  evaluation  attributes,  E. 

A  computation  tuple  over  <U>  is  an  element  in  Dom(<U>).  A  computation-tuple 
sequence  over  <U>  is  a  finite,  nonempty  sequence  of  computation  tuples  over  <U>.  The  set 
of  all  computation-tuple  sequences  over  <U>  is  denoted  by  SEQ(<U>).  For  each  <U>  and  p 
>  1,  SEQ(<U>,  p)  =  [u  in  SEQ(<U>)  I  lul  >  p}.  [The  length  of  a  computation-tuple  se¬ 
quence  u  is  denoted  I  u  I .] 

The  symbol  A  will  denote  the  empty  sequence,  that  is,  the  sequence  which  contains  no 
tuples.  For  each  <U>,  SEQ(<U>,  0;  =  {A}  u  SEQt<U>,  1). 

Unless  otherwise  stated,  u,  v,  and  w  (possibly  subscripted  or  primed)  represent  com¬ 
putation  tuples.  Similarly,  u,  v,  and  w,  represent  computation-tuple  sequences.  And  u,  v, 
and  w,  represent  either  computation-tuple  sequences  or  the  empty  sequence.  The  catena- 
t;on  of  sequences  and  tuples  will  be  denoted  by  juxtaposition. 


Let  u  be  a  computation  tuple  over  <U>  and  A  an  attribute  in  U.  The  value  of  u  on  A 
will  be  denoted  by  u( A).  If  <V>  is  a  subsequence  of  <U>,  then  w[<V>]  and  Ky(u)  both  de¬ 
note  the  tuple  u  over  <V>  with  v(A)  =  u( A)  for  each  A  in  V.  The  tuple  v  is  called  the 
projection  of  u  onto  <V>.  For  each  u  =  uv..un  in  SEQ(<U>),  let  7tv(u)  =  jiv(u1)...rcv(.un).  For 
the  empty  sequence  A,  we  define  ;tv(A)  to  be  A.  For  each1  ‘li  c  SEQ(<U>),  let  7iy(l 1)  =  {kv(u) 

I  u  in  Ifl.  Note  that  Ky  is  the  computation-tuple  sequence  analogue  of  the  relational  data¬ 
base  projection  operator  [Co  70]. 

The  first  component  of  the  spreadsheet-history  model  is  the  spreadsheet  scheme. 

DEFINITION:  A  spreadsheet  scheme  over  <U>  is  a  triple  S  =  (<I>,  <E>,  S),  where 

•  (<I>,,  <E>)  is  an  attribute  scheme  over  <U>;  and 

•  S  =  (sc  I  C  in  E,  sc  is  a  partial  recursive  function  from  SEQ(<U>,  p(sc))  x 
Dom(<U  I  C>)  to  Dom(C),  where  p(sc)  >  0}.  □ 

The  functions  in  S  are  called  spreadsheet  functions.  The  number  p(sc)  is  called  the 
rank  of  sc  and  p(S)  =  max{p(sc)  I  C  in  E}  the  rank  of  S. 

The  rank  determines  the  number  of  computation  tuples  which  must  exist  in  a 
sequence  before  the  spreadsheet  function  can  be  applied. 

The  purpose  of  a  spreadsheet  scheme  is  to  define  a  set  of  “valid”  spreadsheet  se¬ 
quences,  that  is,  sequences  which  are  consistent  with  the  functions  in  the  scheme. 

DEFINITION:  Let  S  =  (<I>,  <E>,  S)  be  a  spreadsheet  scheme  over  <U>.  For  each  C 

in  E,  denote  the  set  of  sequences  valid  with  respect  to  sc  by2  VSEQ(sc)  =  {a1...un  I  Uj(C)  = 
sc(u1...ui_1,  Uj[<U  I  C>!)  for  all  p(sc)  <  i  <  n);  and  for  each  E\  O^E'cE,  let  VSEQtE')  = 

in  E-  VSEQ(sc).  Let  VSEQ(S)  =  VSEQ(S).  □ 

Given  a  spreadsheet  function  sc,  every  sequence  in  SEQ(<U>)  of  length  at  most 
p(sc)  is  in  VSEQ(sc). 

DEFINITION:  For  each  Zl  <z  SEQ(<U>)  and  positive  integer  k,  let  prefixf1!/)  =  {u '  I  u ' 
is  a  prefix  of  some  u  in  ZI\  and  prefixkCy)  =  {S'  in  prefi x(ZI)  I  lu'l  <  k}.  If  Z1  -  prefixfTV) 
then  Zl  is  said  to  be  prefix  closed. 


1.  The  symbol  c  denotes  set  inclusion  while  c  denotes  proper  set  inclusion. 


Notice  that  VSEQ(sc)  is  prefix  closed.  Thus,  VSEQ(S)  is  also  prefix  closed. 

To  complete  the  spreadsheet-history  model,  there  must  be  a  mechanism  to  provide 
the  evaluation-attribute  values  at  the  beginning  of  a  computation-tuple  sequence.  This  is 
done  using  a  prefix-closed  set  of  sequences  of  length  at  most  the  rank  of  the  spreadsheet 
scheme. 

DEFINITION:  Given  a  spreadsheet  scheme  S  over  <U>,  an  initialization  ( with  re¬ 
spect  to  S)  is  a  recursively  enumerable,  prefix-closed  subset  I  of  {u  in  VSEQ(S)  I  lul  < 
max{l,  p(S)} }.  Given  an  initialization  I,  let  VSEQCD  denote  the  set  I  kj  [u  in  SEQ(<U>)  I  u 
-  ufu2  f°r  some  ul  in  I  of  length  p(S)}.  □ 

Clearly,  VSEQ(i)  is  prefix  closed. 

A  set  of  spreadsheet  histories  is  defined  by  a  spreadsheet-history  scheme.  Formally: 

DEFINITION:  a  spreadsheet-history  scheme  (abbreviated  SHS)  over  <U>  is  an  or¬ 
dered  pair  H  =  (S,  I),  where 

•  S  is  a  spreadsheet  scheme  over  <U>;  and 

•  /  is  an  initialization  with  respect  to  S. 

Let  p(H),  called  the  rank  of  H,  be  max[  1,  p(S)}.  □ 

An  SHS  determines  valid  spreadsheet  histories  as  follows: 

DEFINITION:  For  each  SHS  H  =  (S,  I)  let  VSEQ(H)  =  VSEQ(S)  n  VSEQtf).  A 
spreadsheet  sequence  is  said  to  be  valid  (for  H)  if  it  is  in  VSEQ(H).  □ 

Since  both  VSEQ(S)  and  VSEQ(/)  are  prefix  closed,  so  is  VSEQ(H). 

Example  1. 1  (continued):  The  stock-purchase  history  can  be  recast  using  the  for¬ 
mal  model.  The  labels  in  line  1  of  Figure  1. 1  will  be  ignored  since  they  do  not  enter  into 
any  of  the  calculations.  Each  of  the  other  lines  will  be  represented  by  spreadsheets  (com¬ 
putation  tuples).  An  SHS  over  <U>  for  the  income  history  is  H  =  ((<!>,  <E>,  S),  I),  where 

•  <I>  =  <DATE,  TRANS,  SHARES,  PSV>. 

•  <E>  =  <V.4LUE,  PROFIT,  CUMSH>. 

•  The  domains  of  the  attributes  are  the  obvious  ones. 


VALUE>])  -  “„.<PSV>  X  “-.(SHARES) 


f{nl(ur..un),  un+1[< I>1) 
undefined 


un+1(VALUE) 

.  0 

f 

SCUMSH(“l-“n«“n+lt<U1CUMSH>J)  = 


if  un+1(TRANS)  =  SELL  and 
(un+1(SHARES)  <  wn(CUMSH» 
if  un+1(TRANS)  =  SELL  and 
(un+1(SHARES)  >  un(CUMSH)) 
ifun+1(TRANS)  =  DIV 
ifun+1(TRANS)  =  BUY 


‘  un(CUMSH)  -  un+1(SHARES)  if  un+i(TRANS)  =  SELL  and 

(un+1(SHARES)  <  un(CUMSH)) 
undefined  if  «n+1(TRANS)  =  SELL  and 

(u  /SHARES) > u(CUMSH)) 

n+l  n 

.  u  (CUMSH)  +  u  /SHARES)  otherwise, 
n  n+l 


Figure  1.  2.  Function  definitions  for  Example  1. 1 


•  I  =  {  u  in  Dom(<U>)  I  u(TRANS)  =  BUY,  u(SHARES)  >  0,  u(VALUE)  = 
u(PSV)  x  w(SHARES),  u(PROFIT)  =  0,  and  u(CUMSH)  =  u(SHARES). } 

•  The  functions  in  S  =  {  Sy^yg,  sPR0FIT,  sCUMSH  }  are  defined  for  each  uv..un  in 
SEQ(<U>,  1)  and  tuple  un+1  in  Dom(<U>)  as  shown  in  Figure  1.  2. 

In  Figure  1.  2,  un+1[<I>])  is  the  function  from  SEQ(<I>,  0)  x  Dom(<I>) 

into  Dom(PROFIT)  which  returns  the  profit  for  the  sale  of  stock  using  the  FIFO  method 
required  by  the  tax  laws.  This  is  a  straightforward,  albeit  detailed,  calculation  The 
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earliest-purchased  unsold  stock  can  be  found  by  summing  the  most  recently  received 
shares  until  the  total  equals  or  exceeds  the  current  number  of  shares  owned.  The  specifics 
are  omitted.  (Note  that  f  is  defined  solely  over  input-attribute  values  even  though  a 
program  for  f  could  make  use  of  some  of  the  VALUE  and  CUMSH  attributes  in  the 
sequence.)  □ 

In  the  spreadsheet-history  model,  a  spreadsheet  function  is  defined  with  respect  to 
a  computation-tuple  sequence.  In  real-world  applications,  a  spreadsheet  function  is  often 
determined  solely  by  a  bounded  number  of  computation  tuples  at  the  end  of  the  sequence. 
Such  history  bounded  functions  represent  those  spreadsheet  functions  which,  in  one  sense, 
can  be  implemented  efficiently.  More  formally: 

DEFINITION:  Let  sc  be  a  spreadsheet  function  in  the  spreadsheet  scheme  S  =  (<I>, 
<E>,  S)  over  <U>  and  k  a  non-negative  integer.  If  s c(uv,  u;[<U  I  C>])  =  sc(o ,  u;[<U  I  C>])  for 
all  sequences  uv  w  in  SEQ(<U>),  where  I  v  I  =  k,  then  sc  is  said  to  be  k-history  bounded.  A 
spreadsheet  function  is  said  to  be  history  bounded  if  it  is  k-history  bounded  for  some  k.  If 
all  spreadsheet  functions  in  S  are  k-history  bounded,  then  S  is  also  said  to  be  k-history 
bounded.  Likewise,  S  is  said  to  be  history  bounded  if  it  is  k-history  bounded  for  some  k. 

If  sc  is  k-history  bounded,  then  k  >  p(sc).  Suppose  sc  is  k-history  bounded  for  some 
k.  If  is  easily  seen  that  sc  is  also  m-history  bounded  for  all  m  >  k. 

DEFINITION:  An  SHS  H  =  (S,  /)  over  <U>  is  said  to  be  k-history  bounded  ( history 
bounded )  if  S  is  k-history  bounded  (history  bounded).  Q 

In  Example  1. 1  the  spreadsheet  functions  sVALUE  and  sCUMSH  are  both  history 
bounded,  but  sPROFIT  is  not.  Thus,  SHS  is  not  history  bounded. 

In  the  sequel,  the  primary  concern  will  be  to  examine  the  conditions  under  which 
operations  on  sets  of  spreadsheet  histories  preserve  the  SHS  formalism.  Consequently,  be¬ 
fore  proceeding  to  the  technical  results,  we  need  one  final  definition. 

DEFINITION:  A  set  if  or  an  SHS  H  is  said  to  be  ( history-bounded -)  SHS  represent¬ 
able  if  there  exists  a  (history-bounded)  SHS  H'  such  that  U  =  VSEQ(H')  or  VSEQ(H)  = 
VSEQ(H')  respectively.  □ 


3  Relational  Operators 


This  section  presents  some  spreadsheet-history  analogues  to  the  relational-database 
operators  [Co  70;  U1  82;  Ma  83]. 

We  first  address  a  selection  operator  analogue,  called  the  historical-selection  opera¬ 
tor.  Historical-selection  operators  are  unary  functions  from  2SE<^<U>)  to  2SE^-<u>-)  which  re¬ 
turn  prefix-closed  sets  of  histories.  Our  first  result  shows  that  the  historical-selection  oper¬ 
ator  preserves  SHS  representability.  The  last  result  demonstrates  the  undecidability  of  de¬ 
termining  when  a  historical-selection  operator  maps  a  set  to  itself. 

DEFINITION:  For  each  computable  mapping  ©  from  SEQ(<U>)  to  [true,  false) ,  let  a 
be  the  function  defined  for  each  subset  U  of  SEQ(<U>)  by  a  ('ll)  =  prefix({u  in  1i  I  Q(u)  = 

TJ 

true}).  The  function  a_  is  called  a  historical  selection  operator.  □ 

In  the  preceding  definition,  the  mapping  0  acts  as  a  selection  criterion  to  pick  a  sub¬ 
set  of  the  histories  in  U,  namely,  those  histories  u  for  which  0(u)  =  true.  The  query  c 
takes  the  prefix  closure  of  the  collection  chosen  by  0.  The  query  is  “historical”  in  the  sense 
that  it  is  possible  to  reach  each  history  in  cJZI)  from  a  length-one  history. 

The  first  major  result  of  this  section  will  show  that  every  historical-selection  opera¬ 
tor  preserves  SHS  representability,  i.e.,  if  Zl  c  SEQ(<U>)  is  SHS  representable  and  ©  is  a 
computable  mapping  from  SEQ(<U>)  to  [true,  false),  then  aJZl)  is  SHS  representable. 

THEOREM  3. 1:  Let  H  be  an  SHS  over  <U>,  and  ©  a  computable  mapping  from 
SEQ(<U>)  to  {true,  false).  Then  a  (VSEQ(H))  is  SHS  representable.  □ 

The  proof  of  Theorem  3. 1  uses  a  spreadsheet  function  that  “performs”  the  “selec¬ 
tion”.  Whether  or  not  ae(VSEQ(H))  is  history-bounded-SHS  representable  cannot  in 
general  be  inferred  from  the  construction  of  the  function. 

A  historical-selection  operator  selects  a  subset  of  histories  using  a  criterion  of  inter¬ 
est.  Because  a@(VSEQ(H))  is  prefix  closed,  some  care  must  be  taken  when  formulating  a 
selection  criterion.  Consider  the  following. 

Example  3. 1:  Let  H  be  the  stock-purchase  SHS  from  Example  1.  1  and  &(u)  =  true 
if  u  contains  a  stock  purchase  of  5000  or  more  shares.  Ostensibly,  c_(VSEQ(H))  should  re- 


turn  only  those  histories  which  contain  large  stock  purchases.  But  because  o0(VSEQ(H))  is 
prefix  closed,  gJVSEQ(H))  =  VSEQ(H).  [Consider  an  arbitrary  sequence  u1  ...  un  in 
VSEQ(H).  Let  un+1  be  the  tuple  where  un+1(DATE)  =  un(DATE),  un+1(TRANS)  =  “BUY,” 
un+1(SHARES)  =  5000,  wn+1(PSV)  =  $5.00,  un+1(VALUE)  =  ...  un, 

nn+j[<U  1  VALUE>]),  wn+1(PROFIT)  =  sPR0FlT(Ul  ...  un,  un+1[<U  I  PROFIT>]),  and 
un+1(CUMSH)  =  ScumshC^!  ...  uQ,  un+1[< U I  CUMSH>]).  Then  ux ...  un+1  is  in  ae(VSEQ(H)). 
Hence,  by  prefix  closure,  so  is  u1 ...  un.]  □ 

Example  3. 1  shows  that  o„(VSEQ(H))  may  not  be  a  proper  subset  of  VSEQ(H).  The 
next  result  shows  that  it  is  recursively  unsolvable  to  determine  whether  or  not 
a  (VSEQ(H))  =  VSEQ(H)  for  an  arbitrary  historical  selection. 

THEOREM  3. 2:  It  is  recursively  unsolvable  to  determine  for  an  arbitrary  U,  an  arbi¬ 
trary  SHS  H  over  <U>  and  an  arbitrary  computable  mapping  ©  defined  from  SEQ(<U>)  to 
{true,  false)  whether  or  not  ct_(VSEQ(H))  =  VSEQ(H).  □ 

Some  of  the  properties  of  projection  will  now  be  presented.  Clearly,  projection  pre¬ 
serves  prefix  closure.  However,  it  does  not  in  general  preserve  SHS  representability.  In¬ 
deed,  let  H  be  an  SHS  over  <U>  =  <IxE>,  then  neither  Xj(VSEQ(H))  nor  jte(VSEQ(H))  is 
SHS  representable.  Problems  may  still  arise  even  if  the  projection  operator  is  restricted  to 
a  subset  V  of  U  such  that  V  n  IM  #  u  and  V  n  E,^  0.  It  is  possible  that  the  loss  of 

information  under  the  projection  mapping  could  preclude  the  existence  of  spreadsheet 
functions  for  the  resulting  VSEQ. 

PROPOSITION  3. 3:  Let  R  an  SHS  over  <U>  =  <IxE>,  <U'>  =  <I'xE'>,  0  *  I'  £ 
I,  and  0*E'cE.  Then  Try-CVSEQCH))  is  SHS  representable  if  and  only  if  there  exists  some 
r  such  that  for  all  n  >  r  and  all  pairs  of  sequences  ux ...  un  and  wx ...  wD  in  VSEQ(H), 

(*)  nv.{ul ...  ...  waml)  and  7cr(Kn)  =  rcr(u;n)  imply  un[E'l 

=  o>n[E'].  □ 

The  next  result  demonstrates  that  the  VSEQ  of  every  spreadsheet-history  scheme  is 
the  projected  image  of  a  history-bounded  spreadsheet-history  scheme. 

Theorem  3. 4:  For  each  SHS  H  over  <U>,  there  exists  a  p(H)-history-bounded  SHS 
H'  such  that  Ky  maps  VSEQ(H')  one-to-one  onto  VSEQ(H).  □ 


The  proof  of  Theorem  3. 4  encodes  the  entire  previous  history  in  a  single  attribute 
value.  However,  the  complexity  of  encoding  the  historical  information  may  be  greater  than 
the  complexity  of  the  functions  in  the  initial  scheme. 

A  computation-tuple  sequence  analogue  to  the  relational-database  join  operator, 
called  cohesion,  was  defined  in  [GTa  89].  We  now  examine  the  cohesion  of  spreadsheet 
histories.  First  we  shall  show  that  the  cohesion  of  two  SHS-representable  sets  is  SHS 
representable.  Then  we  shall  address  a  problem  concerning  minimum  representations. 

DEFINITION:  Given  <U>  and  <V>,  the  cohesion  of  u  in  SEQ(<U>)  and  v  in 
SEQ(<V>),  denoted  u  ©  v,  is 

1)  the  computation-tuple  sequence  w  in  SEQ(<UV>)  such  that  n^iw)  =,  u  and 
kv(w)  =  v  if  ka(u)  =  ka(v )  for  each  A  in  U  n  V,  and 

2)  undefined  otherwise. 

The  cohesion  of  U  s  SEQ(<U>)  and  SEQ(<V>),  denoted  U  ©  V,  is  the  set  {  u  ©  v  I  u  in 
U,  v  in  V}.  □ 

Since  the  attributes  in  are  ordered  by  <„,  <UV>  =  <VU>.  Hence,  u  ©  u  =  v  ©  u  for 
each  u  in  SEQ(<U>)  and  u  in  SEQ(<V>). 

A  question  which  naturally  arises  is:  Does  cohesion  preserve  SHS  representability? 
In  other  words,  is  VSEQfHj)  ©  VSEQ(H2)  SHS  representable  for  all  SHS  Hx  and  H2?  The 
answer  is  yes.  To  demonstrate  this,  we  need: 

DEFINITION:  Let  Hx  =  ((<IX>,  <EX>,  SJ,  JL)  and  H2  =  ((<I2>,  <E2>,  S2),  /2)  be  SHS 
over  <U>  and  <V>  respectively,  and  r  =  max{p(H1),  p(H2)}.  For  i  =  1,  2  and  each  B  in  Ef,  let 
siB  denote  the  spreadsheet  function  for  B  in  St.  The  cohesion  of  Hx  and  H2,  denoted  Hj  © 
H2,  is  the  SHS  ((<I1I2>,  <EXE2>,  S),  I ),  where: 

•  I  =  prefixr(VSEQ(H1)  ©  VSEQ(H2)). 

•  For  each  B  in  (Ej  u  E2)  -  (Ex  n  E2),  sB  is  the  rank-r  spreadsheet  function  de¬ 
fined  by  sB(u>,  x)  =  s1B(7tu(uJ),  x[<U  I  B>])  if  B  is  in  Ex  and  sB(w,  x )  =  s 2B(7tv(iZ0, 
x[<V  I  B>])  if  B  is  in  E2. 

•  For  each  B  in  Ex  n  E2,  sB  is  the  rank-r  spreadsheet  function  defined  by 
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f  Bis^uduj),  jc[<U  I  B>]) 


sB(uJ,  x)  =  < 


if  s1B(7tu(ui),  x[<U  I  B>])  = 
s2B(ny(w),  *f<V  I  B>]) 


l  undefined  otherwise. 


□ 


The  first  result  concerning  the  cohesion  of  SHS  shows  that  the  cohesion  operator 
preserves  SHS  representability. 

Theorem  3. 5:  Let  Hx  and  H2  be  SHS.  Then 

(*)  VSEQ(HX  ©  H2)  =  VSEQfHj)  ©  VSEQ(H2). 

Furthermore,  if  Hx  and  H2  are  history-bounded-SHS  representable,  then  so  is  Hx  ©  H2.  □ 

We  now  turn  our  attention  to  a  question  about  “minimum  representations”  with  re¬ 
spect  to  cohesion.  To  motivate  this  idea,  let  Hx  and  H2  be  SHS  over  <U>  and  <V>  respec¬ 
tively.  Suppose  a  collection  of  spreadsheet  histories  11  e  VSEQ(HX)  is  maintained  on  com¬ 
puter  one  and  Vq  VSEQ(H2)  is  maintained  on  computer  two.  To  calculate  11©  Von  com¬ 
puter  one,  Vmust  be  transmitted  from  computer  two  via  some  communication  channel.  To 
reduce  the  use  of  the  channel,  only  histories  in  V  which  will  participate  in  the  cohesion 
should  be  transmitted.  It  is  easily  seen  that  rcv (11©  V)  is  exactly  the  set  which  should  be 
sent.  However,  at  site  two  the  contents  of  11  cannot  be  known  without  prior  communica¬ 
tion.  Barring  the  existence  of  information  about  11  at  computer  two,  the  best  that  can  be 
done  is  to  send  only  those  histories  in  Vwhich  can  participate  in  at  least  one  cohesion  with 
some  history  in  VSEQfH^. 

Ideally,  we  would  like  to  find  an  SHS  H2  such  that  VSEQ(HX)  ©  VSEQ(H2)  = 
VSEQCHj)  ©  VSEQ(H2)  and  VSEQ(H2)  is  a  minimum  with  respect  to  containment  in 
VSEQ(H2).  (Of  course,  it  would  have  to  be  established  that  such  an  SHS  exists.)  The 
scheme  H2  could  then  be  used  as  a  filter  to  possibly  reduce  the  number  of  histories  sent  to 
computer  one  (i.e.,  only  histories  in  Vn  VSEQ(H2)  need  be  sent). 

If  such  an  H2  exists,  then  a  corresponding  minimal  Hx  might  also  exist  by  analogy. 
If  Hj  and  H2  are  minimal  with  respect  to  each  other,  then  we  call  the  pair  a  minimum  rep¬ 
resentation  of  Hx  and  H2.  More  formally: 
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DEFINITION:  Let  Hx  be  an  SHS  over  <U>  and  H2  an  SHS  over  <V>.  An  ordered  pair 
(Hj,  H2)  of  SHS  is  called  a  minimum  representation  of  (Hj,  H2)  (with  respect  to  cohesion)  if 

1)  VSEQ(H1)  ©  VSEQ(H2)  =  VSEQ(Hp  ©  VSEQ(H2),  and 

2)  VSEQ(H')  c  VSEQ(Hj')  and  VSEQ(H2)  c  VSEQ(H2')  for  all  HJ'  over  <U>  and 
H2  over  <V>  such  that  VSEQfHj)  ©  VSEQ(H2)  =  VSEQ(H")  ©  VSEQ(H2 ).  □ 

The  next  theorem  asserts  that  every  pair  of  SHS  has  a  minimum  representation. 

THEOREM  3. 6:  Let  Hx  and  H2  be  SHS  over  <U>  and  <V>  respectively.  Then  (H1# 
H2)  has  a  minimum  representation  (Hj,  H2).  Furthermore,  VSEQCHp  =  TCuCVSEQCHj)  © 
VSEQ(H2))  and  VSEQ(H2)  =  TtyCVSEQCHj)  ©  VSEQ(H2)).  □ 

Suppose  both  Hx  and  H2  are  history-bounded-SHS  representable.  From  Theorem 
3.  5,  we  know  that  Hx  ©  H2  is  also  history-bounded-SHS  representable.  The  question  aris¬ 
es:  if  (Hj,  H2)  is  a  minimum  representation  of  (Hj,  H2),  then  are  H'x  and  H2  also  history- 
bounded-SHS  representable?  It  is  straightforward  to  construct  a  counterexample. 

Turning  to  intersection,  note  that  VSEQtH!)  n  VSEQ(H2)  =  VSEQCH!)  ©  VSEQ(H2). 
Thus,  as  a  corollary  to  Theorem  3.  5,  we  have: 

PROPOSITION  3. 7:  Let  Hj  and  H2  be  SHS  over  <U>.  Then  VSEQC^)  n  VSEQ(H2) 
is  SHS  representable.  Furthermore,  if  both  Hx  and  H2  are  history  bounded,  then 
VSEQCHjl)  n  VSEQ(H2)  is  history-bounded-SHS  representable.  □ 

The  state  of  affairs  for  the  union  operator  is  not  as  good.  Let  Hx  and  H2  be  SHS  over 
some  <U>.  In  general,  VSEQfHj)  u  VSEQ(H2)  is  not  SHS  representable  because  the 
spreadsheet  schemes  may  have  disparate  functions  for  their  evaluation  attributes.  Howev¬ 
er,  if  the  spreadsheet  functions  in  the  two  SHS  are  compatible  with  each  other  then  the 
union  may  be  SHS  representable. 

DEFINITION:  Let  Hx  and  H2  be  SHS  over  <U>.  Then  Hj  and  H2  are  said  to  be 
compatible  if  there  exists  a  spreadsheet  scheme  S  over  <U>  such  that  VSEQCHj)  c 
VSEQ(S)  and  VSEQ(H2)  C  VSEQ(S).  □ 

PROPOSITION  3. 8:  Let  Hx  and  H2  be  SHS  over  <U>.  Then  VSEQfHj)  u  VSEQ(H2) 
is  SHS  representable  if  and  only  if  Hx  and  H2  are  compatible.  □ 


4  Projection  Simulation 


In  [GK  88],  the  question  of  when  two  SHS  (or  CSS)  describe  the  same  set  of  histo¬ 
ries  was  studied.  The  definition  of  “sameness”,  called  “projection  simulation,”  was  based  on 
the  computation-tuple-sequence  analogue  to  the  relational-database  projection  operator. 
Under  AFOSR  support,  this  study  was  extended  to  the  subclass  of  history-bounded  SHS. 
In  this  section,  the  formal  definition  of  projection  simulation  is  presented  and  then  exam¬ 
ined  with  respect  to  history-bounded  SHS. 

Suppose  we  wish  to  implement  a  database  application  in  which  stock-transaction 
histories  are  represented  as  computation-tuple  sequences  that  are  valid  with  respect  to 
the  SHS  H  of  Example  1.1.  To  uniquely  identify  a  history  we  need  to  know  its  initializa¬ 
tion  and  its  subsequent  inputs.  From  this  and  the  history  scheme,  we  can  derive  the  value 
for  each  evaluation  attribute  of  each  tuple  in  the  history.  Storing  just  the  initialization 
and  inputs  is  a  space-efficient  way  to  maintain  the  database.  From  a  computational-effi¬ 
ciency  perspective,  however,  this  may  be  a  poor  method.  For  each  query  based  upon  the 
values  of  evaluation  attributes,  the  database  system  would  have  to  calculate  these  values. 
Furthermore,  each  time  an  update  is  performed  (i.e.,  a  new  tuple  is  added  to  a  history)  it 
may  be  necessary  to  recalculate  the  values  of  the  evaluation  attributes  for  the  entire  histo¬ 
ry.  [An  update  is  not  valid  unless  all  of  the  evaluation-attribute  values  in  the  new  tuple 
are  defined.  In  general,  these  values  depend  on  the  previous  evaluation-attribute  values.] 

Suppose  it  is  known  that  95%  of  the  queries  will  be  base'3  on  the  input  attributes 
and  the  PROFIT  evaluation  attribute  (the  transaction  profit).  To  strike  a  balance  between 
space  and  computation  efficiency,  we  may  want  to  store  only  the  PROFIT  evaluation-at¬ 
tribute  values.  We  could  then  process  95%  of  the  queries  without  having  to  calculate  the 
other  evaluation-attribute  values.  (The  other  5%  of  the  queries  would  still  require  these 
calculations.)  Ideally,  we  would  like  to  find  an  SHS  H'  over  <V>  =  <IxPROFIT>  such  that 
the  projection  operator  Ky  maps  VSEQ(H)  one-to-one  onto  VSEQ(H').  We  could  then  main¬ 
tain  the  database  using  the  SHS  H'  and  eliminate  the  need  to  calculate  the  values  for  the 
evaluation  attributes  in  E  -  {PROFIT}  during  an  update. 


In  a  sense,  the  two  SHS  H  and  H'  would  define  the  same  set  of  stock-transaction 
histories  because  each  sequence  in  VSEQ(H)  would  correspond  to  a  unique  sequence  in 
VSEQ(H'),  one  comprised  of  the  same  inputs  and  a  nonempty  subset  of  the  same  evalua¬ 
tion  attribute  values.  The  roles  played  by  the  attributes  in  H'  are  identical  to  those  played 
by  the  same  attributes  in  H.  Intuitively  speaking,  because  we  have  a  one-to-one  onto  map¬ 
ping,  the  information  lost  by  not  maintaining  the  attributes  in  E  —  {PROFIT}  is  redundant 
in  that  it  is  not  necessary  for  discriminating  between  different  histories. 

We  shall  call  the  concept  of  sameness  defined  above,  projection  simulation  or  p-sim- 
ulation.  Formally,  we  have: 

DEFINITION:  Let  H1  be  an  SHS  over  <U>  =  <IxE>  and  Hg  an  SHS  over  <V>  = 
<IxE'>,  with  E'  c  E.  Hg  projection  simulates  ip- simulates)  Ht  if  7ty  maps  VSEQfHj)  one- 
to-one  onto  VSEQ(H2).  □ 

We  now  present  three  separate  conditions  sufficient  to  insure  that  p-simulation  pre¬ 
serves  history  boundedness.  Each  condition  is  stated  in  terms  of  a  special  type  of  data  de¬ 
pendency  called  a  history -bounded  dependency. 

DEFINITION:  Let  <U>  be  a  finite  sequence  of  attributes  and  r  a  non-negative  inte¬ 
ger.  A  history-bounded  dependency  over  <U>  is  an  ordered  pair  (X,  Y),  where  XY  c  U.  A 
set  Zl  s  SEQ(<U>)  is  said  to  rank-r  satisfy  (X,  Y),  denoted  Zl  hr  (X,  Y),  if  for  each  B  in  Y 
and  each  pair  of  sequences  ux  ...  ur+l  and  v1  ...  vr+l  in  IntervaKtD,  %(«!  •••  ur)  -  Kx^vi  ••• 

vr)  and3  71^,  B>(urfi)  =  ^ibX+i)  imPty  ur+i^  =  yr+i^B>-  D 

Note  that  if  \=r  (X,  Y),  then  Zl  t=s  (X,  Y)  for  all  s,  s  >  r. 

If  Zl  Nr  (X,  Y)  for  some  r,  then  we  sometimes  write  Zl  M  (X,  Y)  when  the  particular 
value  of  r  is  unimportant. 

Let  sB  be  an  r-history-bounded  spreadsheet  function  defined  from  SEQ(<U>,r)  x 

Dom(<U  I  B>)  to  Dom(B).  It  is  readily  seen  that  VSEQ(sB)  hr  (U,  B). 

If  Vq  Zl  and  Zl  hr  (X,  Y),  then  V  Nr  (X,  Y).  [Indeed,  let  ux  ...  and  ...  be 
in  Interval('P').  Since  ...  ur+1  and  ...  vr+l  are  also  in  Interval^),  for  each  B  in  Y  7tx^Mi 

3.  If  B  C  for  each  C  in  X,  then  the  value  of  the  expression  <XIB>  is  the  empty  sequence.  In  this  case, 
the  expression  ^<xi  B>(u)  *s  defined t0  be  a  special  value,  called  the  *.npfy  tuple,  which  has  the  property 
that  ^  s  B  ft')  for  all  xple:;  u  and  v. 
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...  ur)  =  xx(v1 ...  yr)  and  B>(urfl)  =  imply  that  u^CB)  =  u^B),  i.e.,  V  hr  (X, 

Y).] 

History-bounded  dependencies  are  similar  in  spirit  to  functional  dependencies  (FDs) 
[Co  72]  in  the  relational  database  model.  Analogous  to  the  rules  of  FD  inference  [DC  73; 
Arm  74;  Ma  83],  there  are  rules  for  inferring  new  history-bounded  dependencies  from  a 
given  set  of  history-bounded  dependencies.  The  following  proposition  states  several  such 
rules. 

PROPOSITION  4.1:  Let  Xv  X^  Yx  and  Y2  be  non-empty  subsets  of  U  and  Zl  c 
SEQ(<U>).  Then 

(a)  Zl  hr  (Xj,  Yx)  implies  Zl  hr  (XjX^  Y1); 

(b)  Zl  br  (X,,  Yx)  and  Zl  h8  (Xj,.  Y2)  imply  Zl  Nmax(r  s)  (X^,  YjY2);  and 

(c)  Zl  t=r  (Xlt  Yx)  and  Zl  hs  (YjX*  Y2)  imply  Zl  hr+s  (X^,  Y2). 

□ 

Theorem  4. 2:  Let  H  be  a  history-bounded  SHS  over  <U>  =  <IxE>,  0  *  Y  c  E,  V  = 
U  -  Y,  and  VSEQ(H)  h  (V,  Y).  Then  each  SHS  over  <V>  which  p-simulates  H  is  history- 
bounded-SHS  representable.  □ 

In  the  proof  of  Theorem  4.  2,  the  condition  VSEQ(H)  h  (V,  Y)  guarantees  that  histo¬ 
ry-bounded  spreadsheet  functions  for  the  (E  -  Y)-attributes  can  be  expressed  solely  in 
terms  of  the  V-attributes.  In  other  words,  VSEQ(H)  1=  (V,  Y)  implies  VSEQ(H)  h  (V,  E  - 
Y).  This  last  condition,  VSEQ(H)  h  (V,  E  -  Y),  is  necessary  for  H  to  be  p-simulated  by  a 
history-bounded  SHS  H'  over  <V>. 

THEOREM  4. 3:  Let  H  =  ((<I>,  <E>,  {sB  I  B  in  E}>,  I)  be  a  history-bounded  SHS  over 
<U>,  O^YcE  and  V  =  U  -  Y.  If  VSEQ(H)  h  (V,  E  -  Y)  and  sB  is  total  for  each  B  in  Y, 
then  each  SHS  over  <V>  which  p-simulates  H  is  history-bounded-SHS  representable.  □ 

The  final  result  of  this  section  demonstrates  the  special  nature  of  zero-history- 
bounded  spreadsheet  functions. 

THEOREM  4. 4:  Let  H  =  ((<I>,  <E>,  S),  I)  be  a  history-bounded  SHS  over  <U>,  0  *  E' 
c  E,  and  <V>  =  <IxE'>.  If  sB  is  zero-history  bounded  for  each  B  in  E  -  E',  then  there  ex¬ 
ists  a  history-bounded  SHS  over  <V>  which  p-simulates  H.  □ 


Speaking  intuitively,  Theorem  4.  4  says  that  the  roles  played  by  zero-history-bound¬ 
ed  spreadsheet  functions  in  an  SHS  may  be  subsumed  by  the  other  evaluation  attributes 
without  affecting  the  history-boundedness  of  the  SHS. 
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